Code is mostly excluded from the knit .html version of this notebook to maintain a clean presentation. It is included in a few places that make sense for demonstrative purposes. The full code is provided in the accompanying .Rmd file.
To run the .Rmd file, make sure the included dependencies Elasticsearch.R and elasticsearch_queries.R are in the same directory as the Rmd, and make sure to set “elasticsearch_host” to the approriate value here (this is not included in the github version for security reasons).
elasticsearch_host <- ""
Discuss the motivation of this twitter study and what it aims to accomplish. Specifically, what is the research question that is being investigated?
##The dataset
Provide details on the dataset being used in this study. Include details on how the dataset was collected and provide references/citations if applicable.
If using the Rensselaer IDEA COVID-TweetIDs dataset, it can be referenced here: Rensselaer IDEA COVID-19 Tweet Dataset
If using the dataset from the paper “Extracting COVID-19 Events from Twitter”, it can be referenced here: Extracting COVID-19 Events from Twitter
Discuss how the data is being used to answer the research question. Provide details on any statistical methods, aggregations, classification, clustering, etc. being used.
Here, run the code to query the dataset from the appropriate elasticsearch index, execute the analysis, and visualize the results.
# query start date/time (inclusive)
rangestart <- "2020-01-01 00:00:00"
# query end date/time (exclusive)
rangeend <- "2020-09-01 00:00:00"
# query semantic similarity phrase
semantic_phrase <- ""
# return results in chronological order or as a random sample within the range
# (ignored if semantic_phrase is not blank)
random_sample <- FALSE
# number of results to return (max 10,000)
resultsize <- 10000
To find the optimal number of high-level theme clusters for this sample, an elbow plot is used:
The plot mostly represents a smooth curve, although there is a distinct “elbow” point between k=8 and k=10. We will select k=8:
k <- 8
To find the optimal number of topic subclusters for each theme cluster, another elbow plot is generated with a separate curve for each theme cluster. Since the within sums of squares can be on different scales for theme clusters of different sizes and levels of diversity, the withinss metric is scaled to 0 mean and unit variance:
Each theme cluster follows a similar plot, again representing a smooth curve. This time there is no clear “elbow” point. A reasonable choice of k can be selected anywhere between 8 and 15. We will select cluster.k=8 for the topic subclusters:
cluster.k <- 8
## [1] "Subclustering cluster 1 ..."
## [1] "Subclustering cluster 2 ..."
## [1] "Subclustering cluster 3 ..."
## [1] "Subclustering cluster 4 ..."
## [1] "Subclustering cluster 5 ..."
## [1] "Subclustering cluster 6 ..."
## [1] "Subclustering cluster 7 ..."
## [1] "Subclustering cluster 8 ..."
## [1] "Plotting cluster 1 ..."
## [1] "Plotting cluster 2 ..."
## [1] "Plotting cluster 3 ..."
## [1] "Plotting cluster 4 ..."
## [1] "Plotting cluster 5 ..."
## [1] "Plotting cluster 6 ..."
## [1] "Plotting cluster 7 ..."
## [1] "Plotting cluster 8 ..."
Present an analysis of the results obtained by your methods. For example:
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 3527 | 0.6989163 | “The problem is that we cannot cure #CoronaVirus, that’s why we need to mitigate it because there’s no vaccine for it” - @mattiaferraresi @robertmarawa #MSW #ReactionMonday | Johannesburg, South Africa |
| 461 | 0.6878533 | There Are No Breakthrough Treatments For Coronavirus, So Don’t Fall For Internet “Cures” #coronavirus #internet #health #healthy #healthyliving #healthylifestyle | Weston, Florida |
| 737 | 0.6773909 | Coronavirus: Though so-called naturopathic influencers on social media claim taking near-lethal doses of vitamin C is the cure for COVID-19, one expert says that vitamin C is unlikely to cure coronavirus. | Glasgow |
| 4599 | 0.6743958 | @melissadderosa Lets get these doctors and nurses the one thing they really need to combat this devastating virus! ENU200 #coronavirus cure! But we need to have support to get that cure to those who need it most and stop this virus in its tracks to save lives #ENU200curescovid19 | Global |
| 3529 | 0.6574262 | @AngelaBelcamino @realDonaldTrump The cure is remaining locked down for however long and waiting for the virus to die out which would in turn destroy the economy, the problem is COVID-19. | Compton, CA |
| center_cosine_similarity | full_text | user_location | |
|---|---|---|---|
| 1018 | 0.7961717 | "@ChidiOdinkalu: U can wear nose mask or wash ur hand to prevent against #CoronaVirus but nose mask does not prevent against #CoronaLeaders. The most deadly viruses are elected….! #CoronaVirusUpdate | EARTH, WORLD |
| 727 | 0.7456827 | Washing your hands is the better way to prevent the spread of germs! #COVID2019 #coronavirus #WashYourHands | TRUMP NATION |
| 3199 | 0.7398792 | The World Health Organisation is advising people to follow five simple steps to help prevent the spread of COVID-19: 1. Wash your hands 2. Cough/sneeze into your elbow 3. Don’t touch your face 4. Stay more than 3ft (1m) away from others 5. Stay home if you feel sick | Pakistan |
| 493 | 0.7219483 | If you want to prevent yourself from #Coronavirus, then use #Palmist Hand Sanitizers. #Sanitizers #SanitizationMaximization #coronavirusinindia #CoronaVirusUpdate #CoronaVirusOutbreak #HealthForAll #CoronaAlert | Delhi, India |
| 3573 | 0.7102517 | Everyone can help prevent the spread of COVID-19. One way is by washing your hands frequently with soap and water for at least 20 seconds. #DontBeASpreader #COVID19 #FlattenTheCurve #PinellasEM #PinellasMC | Pinellas County, Florida |
Summarize the study and discuss key takeways.
Discuss interesting directions for follow-up investigation.
Discuss any technical limitations or general assumptions made in the study that the reader should be aware of.
[1] …
[2] …
[3] …